Introduction to R and Rstudio

Session - Cleaning data with {dplyr}

Zoë Turner
Cartoon image with the word dplyr: go wrangling above. There are two fluffy characters with the bigger called dplyr being ridden by a smaller character with a hat like a cowboy. The cowboy is rounding up three others called data that look less friendly and are being caught with the cowboy's whip

Artwork by @allison_horst

Wrangling

Is the reshaping or transforming of data into a format which is easier to work with

This is often the largest part of many analyses and data science

A note on tidy data

Tidyverse functions work best with tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.

(Broadly, this means long rather than wide tables)

{dplyr} package

  • {dplyr} is a language for data manipulation
  • Most wrangling puzzles can be solved with knowledge of just a few {dplyr} verbs or functions
  • Many of the concepts of these functions exist in SQL but {dplyr} (and other packages) can extend this further

6 functions/verbs to start with

6 key verbs will help us gain a deeper understanding of our data sets.

Note summarise() can also be spelt summarize()

dplyr::arrange()
dplyr::filter()
dplyr::mutate()
dplyr::group_by()
# followed often by 
dplyr::ungroup()
dplyr::summarise()

Building with steps

These verbs aren’t used independently of each other.

Each can be a step in the code, like a recipe but can also be repeated.

A recipe starts with:

potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash

Recipe as code

The potato is the object in R terms and the steps are verbs or functions

Take a potato then
peel then
slice into medium sized pieces then
boil for 25 minutes then
mash

potato |>
peel() |>
slice(size = "medium" |>
boil(time = 25) |>
mash()

the |> can be replaced with the word ‘then’ in this recipe scenario

Pipe

Shortcut key Ctrl+Shift+m

You might be familiar with the pipe |> which comes from the {magrittr} and is available through the {tidyverse} but the new pipe |> doesn’t require any packages to run

Screenshot of the Tools/Options wizard in the Code tab from the side and Editing at the top. Use native pipe operator option to select is highlighted.

Q1. Which organisation provided the highest number of Mental Health (MH) beds?

arrange()

Reorder rows based on selected variable

beds_data |> 
  arrange(beds_av)

Descending data

We need descending order:

beds_data |> 
  arrange(desc(beds_av))

desc() works for text and numeric variables

Q2. Which 2 organisations provided the highest number of MH beds in September 2018?

  • We’ll use arrange() as before to get the “highest number”
  • But we require only observations with the date “September 2018”

filter()

The expression inside brackets should return TRUE or FALSE. We are choosing rows where this expression is TRUE.

beds_data |> 
  filter(date == "2018-09-01") 

A negative test of equality

To exclude and test where the expression is NOT equal !=

beds_data |> 
  filter(date != "2018-09-01") 

Ordered and filtered

filter() first to reduce the number of rows to apply the next code to

beds_data  |> 
  filter(date == "2018-09-01") |> 
  arrange(desc(beds_av)) 

Find the top 2 organisations

This isn’t a key function but useful and there are many other functions for slice...

beds_data  |> 
  arrange(desc(beds_av)) |> 
  filter(date == "2018-09-01") |> 
  slice_head(n = 2)

Q3. Which organisations had the highest percentage bed occupancy in September 2018?

  • We’ll use arrange() as before to find “highest”
  • We’ll use filter() as before to restrict by date “September 2018”
  • But we don’t have a percentage variable in the data

Create new variables

= in this context is an alias not a test of equality

beds_data |> 
  mutate(perc_occ = occ_av / beds_av) |> 
  filter(date == "2018-09-01") |> 
  arrange(desc(perc_occ)) 

Q4. What was the mean number of beds (for the dataset)?

  • Let’s first look at how we’d produce summary statistics like a mean
  • And then see how this can be applied to groups of data like organisations

summarise()

Collapses a single summary value

beds_data |> 
  summarise(mean_beds = mean(beds_av))

Missing values

We’ll need to remove NA values to get a suitable mean. TRUE can also be T

beds_data |> 
  summarise(mean_beds = mean(beds_av,
                             na.rm = TRUE)) 

Have a go!

Instead of mean() use median()

object |> 
  summarise(new_name = function_name(column_name,
                                     na.rm = ???))

Use a sum() statistic twice

object |> 
  summarise(col_1 = function_name(beds_av,
                                  na.rm = ???),
            col_2 = function_name(occ_av,
                                  na.rm = ???)
)
10:00

Answer for summary statistics

median()

beds_data |> 
  summarise(per_occ = median(beds_av,
                             na.rm = TRUE))
# A tibble: 1 × 1
  per_occ
    <dbl>
1     241

sum()

beds_data |> 
  summarise(total_beds = sum(beds_av, na.rm = TRUE),
            total_occupacy = sum(occ_av, na.rm = TRUE))
# A tibble: 1 × 2
  total_beds total_occupacy
       <dbl>          <dbl>
1     412480         368434

Applying summarise() to groups

  • Now we know how to use summarise() (mean() and also sum())
  • We’ll produce a summary value for each value of date

group_by() - persistent grouping

group_by() does nothing to the output alone.
The change occurs behind the scenes.

beds_data |> 
  group_by(date) 

ungroup()

Cartoon of fuzzy creatures created by Allison Horts with party hats on. Two are together and happy but one is behind holding a present and looking sad. The words.

Seeing ungroup()

The data is unchanged but the metadata changes

beds_data |> 
  group_by(date) 

beds_data |> 
  group_by(date) |> 
  ungroup() 

Break?

Option to take this break before an exercise of after

10:00

Q5. Which organisations have the highest mean % bed occupancy?

  • For each of the organisations group_by()
  • Add 2 summary statistics, total_beds/total_occupancy to summarise() the data
  • mutate() the data to create a percentage using the previous
  • Order to find highest by using arrange()

Hint

beds_data |> 
  group_by() |> 
  summarise() |>
  mutate() |> 
  arrange()
10:00

Solution

beds_data |> 
  group_by(org_name) |> 
  summarise(total_beds = sum(beds_av, na.rm = TRUE),
            total_occupancy = sum(occ_av, na.rm = TRUE)) |> 
  mutate(perc_occ = total_occupancy / total_beds) |> 
  arrange(desc(perc_occ))

End of session